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I, Paul Polakis, Ph.D., declare and say as follows: 

1 . I was awarded a Ph.D. by the Department of Biochemistry of the Michigan 
State University in 1984, My scientific Curriculum Vitae is attached to and forms 
part of this Declaration (Exhibit A). 

2. I am currently employed by Genentech, Inc. where my job title is Staff 
Scientist. Since joining Genentech in 1999, one of ray primary responsibilities has 
been leading Genentech's Tumor Antigen Project, which is a large research project 
with a primary focus on identifying tumor cell markers that find use as targets for 
both the diagnosis and treatment of cancer in humans. 

3. As part of the Tumor Antigen Project, my laboratory has been analyzing 
differential expression of various genes in tumor cells relative to normal cells. 
The purpose of this research is to identify proteins that are abundantly expressed 
on certain tumor cells and that are either (i) not expressed, or (ii) expressed at 
lower levels, on corresponding normal cells. We call such differentially expressed 
proteins "tumor antigen proteins'*. When such a tumor antigen protein is 
identified, one can produce an antibody that recognizes and binds to that protein. 
Such an antibody finds use in the diagnosis of human cancer and may ultimately 
serve as an effective therapeutic in the treatment of human cancer. 

4. In the course of the research conducted by Genentech's Tumor Antigen 
Project, we have employed a variety of scientific techniques for detecting and 
studying differential gene expression in human tumor cells relative to normal cells, 
at genomic DNA, mRNA and protein levels. An important example of one such 
technique is the well known and widely used technique of micrparray analysis 
which has proven to be extremely usefiil for the identification of mRNA molecules 
that are differentially expressed in one tissue or cell type relative to another. In the 
course of our research using microarray analysis, we have identified 
approximately 200 gene transcripts that are present in human tumor cells at 
significantly higher levels than in corresponding normal human cells. To date, we 
have generated antibodies that bind to about 30 of the tumor antigen proteins 
expressed from these differentially expressed gene transcripts and have used these 
antibodies to quantitatively determine the level of production of these tumor 
antigen proteins in both human cancer cells and corresponding normal cells. We 
have then compared the levels of mRNA and protein in both the tumor and normal 
cells analyzed. 

5. From the mRNA and protpin expression analyses described in paragraph 4 
above, we have observed that there is a strong correlation between changes in the 
level of mRNA present in any particular cell type and the level of protein 
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expressed from that mRNA in that cell type. In approximately 80% of our 
observations we have found that increases in the level of a particular mRNA 
correlates with changes in the level of protein expressed from that mRNA when 
human tumor cells are compared with their corresponding normal cells. 

6. Based upon my own experience accumulated in more than 20 years of 
research, including the data discussed in paragraphs 4 and 5 above and my 
knowledge of the relevant scientific literature, it is my considered scientific 
opinion that for human genes, an increased level of mRNA in a tumor cell relative 
to a normal cell typically correlates to a similar increase in abundance of the 
encoded protein in the tumor cell relative to the normal cell In fact, it remains a 
central dogma in molecular biology that increased mRNA levels are predictive of 
corresponding increased levels of the encoded protein. While there have been 
published reports of genes for which such a correlation does not exist, it is my 
opinion that such reports are exceptions to the commonly understood general rule 
that increased mRNA levels are predictive of corresponding increased levels of the 
encoded protein. 

7, I hereby declare that all statements made herein of my own knowledge are 
true and that all statements made on information or belief are believed to be true, 
and further that these statements were made with the knowledge that willful false 
statements and the like so made are punishable by fine or imprisonment, or both, 
under Section 1001 of Title 18 of the United States Code and that such willful 
statements may jeopardize the validity of the application or any patent issued 
thereon. 
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Experimental genomics in combination with the growing body of sequence information promise to 
revolutionize the way cells and cellular processes are studied. Infonmatlon on genomic sequence can be used 
experimentally with high-density DNA arrays that allow complex mixtures of RNA and DNA to be interrogated 
in a parallel and quantitative fashion. DNA arrays can be used for many different purposes, most prominently 
to measure levels of gene expression (messenger RNA abundance) for tens of thousands of genes 
simultaneously. Measurements of gene expression and other applications of arrays embody much of what is 
Implied by tfie term 'genomics'; they are broad in scope, large In scale, and take advantage of all available 
sequence Information for experimental design and data interpretation in pursuit of biological understanding. 



Biological and biomedical research is in the 
midst of a significant transition that is being 
driven by two primary factors: the massive 
increase in the amount of DNA sequence 
information and the development of 
technologies to exploit its use. Consequently, we find 
ourselves at a time when new types of experiments are 
possible, and observations, analyses and discoveries are 
being made on an unprecedented scale. Over the past few 
years, more than 30 organisms have had their genomes 
completely sequenced, with another 1 00 or so in progress 
(see www.tigr.org or genomes@ncbi.nlm.nih.gov for 
a list). At least partial sequence has been obtained for 
tens of thousands of mouse, rat and human genes, and 
the sequence of two entire human chromosomes 
(chromosomes 21 and 22) has been determined ' l Within 
the year, a large proportion of the human genome will be 
deciphered, in both public and private efforts, and the 
complete sequence of the mouse and other animal and 
plant genomes will undoubtedly follow close behind. 
Unfortunately, the billions of bases of DNA sequence do 
not tell us what ail the genes do, how cells work, how cells 
form organisms, what goes wrong in disease, how we age 
or how to develop a drug. This is where functional 
genomics comes into play. The purpose of genomics is to 
understand biology, not simply to identify the component 
parts, and the experimental and computational methods 
take advantage of as much sequence information as 
possible. In this sense, functional genomics is less a specific 
project or programme than it is a mindset and general 
approach to problems. The goal is not simply to provide a 
catalogue of all the genes and information about their 
functions, but to understand how the components work 
together to comprise functioning ceils and organisms. 

To take fiiU advantage of the large and rapidly increasing 
body of sequence information, new technologies are 
required. Among the most powerful and versatile tools for 
genomics are high-density arrays of oligonucleotides or com- 
plementary DNAs. Nucleic acid arrays work by hybridization 
of labelled RNA or DNA in solution to DNA molecules 
attached at specific locations on a surface. The hybridization 
of a sample to an array is, in effect, a highly parallel search by 
each molecule for a matching parmer on an 'affinity matrix*, 
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with the eventual pairings of molecules on the surface 
determined by the rules of molecular recognition. Arrays of 
nucleic acids have been used for biological experiments for 
many years^. Traditionally, the arrays consisted of fragments 
of DNA, often with unknown sequence, sjwtted on a porous 
membrane (usually nylon). The arrayed DNA fragments 
often came from cDNA, genomic DNA or plasmid libraries, 
and the hybridized material was often labelled with a radioac- 
tive group. Recendy, the use of glass as a substrate and fluores- 
cence for detection, together with the development of new 
technologies for synthesizing or depositing nucleic acids on 
glass slides at very high densities, have allowed the miniatur- 
ization of nucleic acid arrays with concomitant increases in 
experimental efficiency and information content*"'^ (Fig. 1 ). 

While making arrays with more than several hundred 
elements was until recently a significant technical 
achievement, arrays with more than 250,000 different 
oligonucleotide probes or 10,000 different cDNAs per 
square centimetre can now be produced in significant 
numbers'**'*. Although it is possible to synthesize or deposit 
DNA fragments of unknown sequence, the most common 
implementation is to design arrays based on specific 
sequence information, a process sometimes referred to as 
'downloading the genome onto a chip* (Fig. 1). There are 
several variations on this basic technical theme: the 
hybridization reaction may be driven (for example, by an 
electric field)'' '*; other detection methods" besides fluores- 
cence can be used; and the surface may be made of materials 
other than glass such as plastic, silicon, gold, a gel or 
membrane, or may even be comprised ofbeads at the ends of 
fibre-optic bundles^*"". Nonetheless, the key elements of 
parallel hybridization to localized, surface-bound nucleic 
acid probes and subsequent counting of bound molecules 
are ubiquitous, and high-density arrays of nucleic acids on 
glass (often called DNA microarrays, oligonucleotide 
arrays, GeneChip arrays, or simply 'chips') and their 
biological uses will be the focus of this review. 

Global gene expression experiments 

One of the most important applications for arrays so far is the 
monitoring of gene expression (mRNA abundance). The col- 
lection of genes that are expressed or transcribed from 
genomic DNA, sometimes referred to as the expression 
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Figure 1 Pdncipal types of arrays used In gene expression monitdring. Muqieic add- 
arrays are generally product in pneiof two ways: fay ftbptic d^ltion of nucleic acids 
(PGR products, plasmids or oligonudeolides) onto a glass slide^ or //?s/?usynthesis ; 
(u^ng phdtoiithography'^ of oligonucleotides.. Shown are psBudocolour images of: 
a. an oligonucleotide an^y and b, a cDNA array after hybridization of labeljed isam'ples ' , 
and fiuorescence detection. In both cases the images have been coloured to Indicate - 
the relative number of ye£6ttransdripts present under two different grtnivth conditions - 
(red, high in condition 1 , low in condition 2; green, high in condition 2, low In condition 
1 ; yellow, high under both conditions; blade, low under both (^»xlltions}. In the case of 
photolithographicalty ^nthestzed anrays, -10' copies of each selected oligonucleotide 
(usually 20 to 25 nudebtldes In length) are synthesized base by base In hundreds of 
thousands of different 24 ixm x 24 jim areas on a 1 .28 cm x 1 .28 cm glass 
surface. For robotic deposition, approximately one nanogram of material Is deposited 
at Intervals of 1 00-300 jim. Typicaliy for oligonucleotide arrays, multiple probes per 
gene are placed on the array (20 pairs in the example shown here), while in the case of 
robotic deposition, a single, longer (up to 1 ,000 bp) double-stranded DNA probe is 
used for each gene or EST. In both cases, probes are usually designed from sequence 
located nearer to the 3' end of the gene (near the po!y-A tail In eukarvotic mRNA), and 
different probes can be used for different exons. After hybridization of labelled samples 
(typically overnight), the arrays are scanned and the quantitative fluorescence image 
along with (he known identity of the probes Is used to assess the 'presence' or 
absence' (more precisely, the detectability above thresholds based on background and 
noise levels) of a particular molecule (such as a transcript), and its relative abundance 
in one or more samples. Because the sequence of the oiigonudeotide or cONA at each 
physical location (or address) is generally known or can be detennlned, and because 



the^recognltlon mles thait govern hybridization are well understood, the sign^ inlehsity 
at each position gives not only a measure of the numberof molecules bound, but also 
the likely Identity of the molecules. Altbough oligonucleotide probes vary systematicialiy 
in their bybridization efficiency, quantitative estimates of the number of transcripts per ; 
Cejtcanbeobtaineddlrectiy t^a^/erligingt^ For 
technical reasons, the ihforhiatlqn obtained from spi^ttdd cbl^ autrays gives the retail 



competitive, biw}-cdourhybridiz3tk»i^. Messenger RNAs present at afew copies 
(relative abundance of - 1 :1 00.000 or less) to thousands of copies per mammalian cell 
can be detectecP'*^, and changes as subtle as a factor ol 1 .3 to 2 can be reliably - 
detected If replicate experiments are performed, c. Different methods for preparing 
labelled material for measurements of gene expression. The RNA can be labelled 
directly, using a psoralen-biotin derivative or by ligation to an RNA molecule carrying 
biotin^; labelled nucleotides can be incorporated into cDNA during or after reverse 
transcription of polyadenylated RNA; or cDNA can be generated that carries aT7 
pramoter at its 5' end. In the last case, the double-stranded cDf^ serves as template 
for a reverse transcription reaction in which labelled nucleotides are incorporated into 
cRNA. Commonly used labels Include the fluorophores fluorescein. CyS (or Cy5). or 
nonfluorescent blotin, which is subsequently labelled by staining with a fluorescent 
streptavidin conjugate, d, Two-colour hybridization strategy often used with cONA 
microarrays. cONA from two different condltkins is labelled with two different 
fluorescent dyes (usually Cy3 and CyS), and the two samples are co-hybridized to an 
an^ay. Alter washing, the array is scanned at hm different wavelengths to detect the 
relative transcript abundance for each condition. cOMA anay image courtesy of J. 
DeRisI and P. 0. Brown (httpy/cmgm.$tanford.edu/pbrDwn/yeastchlp.html). 
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Figure 2 Messenger RNA abundance levels in different 
cells, tissues and organisms, a. Human HW-lnfected T 
lymphocytes: b. mouse olfactory epithelium: c, rat brain; 
d, S. cemlsiae strain RY1 36 grown at 25 °C in rich medium. 
Levels of gene expression were measured using Affymetrix 
oligonucleotidB arrays. For human, mouse and rat samples, 
hybridization intensities were converted to copies per cell (top 
axis) based on the signal from multipte control RNAs added to 
the samples at known concentrations. For yeast, the 
conversion was based on the signal from the TATArfaindIng ■ 
protein (IBP) mRNA. which has been determined to be 
present at -3.5 copies per cell vyhen yeast ceils are grown in 
rich medium"". Only those genes scored as 'present' ard 



containing probes fora different subset of genes and ESIS; 
were combmed«to generate the plots for human (five arrays, 
mouse (five arrays) and rat (three arrays). All yeastORFs were 
represented on a single array. For measurements that cover 
such a large number of genes, it ts important to malrrtain high 
standards of data quality to keep false-posHwe results to a ^ 
minimum, (For example, when monitoring 10,000 genes, ^^ 
even a low false-positiye rate of 1 % results In TOO false calis.) 
We find that thesource of most false positives {n large part 
the result of setting the lowest possible thresholds i n the 
interest of sensitivity) Is random noisej biological variation, jor 
the occasbnai array-spei^flc physical defect, so observatiohs 
made consistently tn thdependent replicates yield a 
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profile or the *transcriptome*, is a major determinant of cellular pheno- 
type and function. The transcription of genomic DNA to produce 
mRNA is the first step in the process of protein synthesis, and 
differences in gene expression are responsible for both morphological 
and phenotypic differences as well as indicative of ceUular responses to 
environmental stimuli and perturbations. Unlike the genome» the 
transcriptome is highly dynamic and changes rapidly and dramatically 
in response to perturbations or even during normal ceUular events 
such as DNA replication and cell divisidn^'"^*. In terms of understand- 
ing the function of genes, knowing when, where and to what extent a 
gene is e3q>ressed is central to understanding the activity and biological 
roles of its encoded protein. In addition, changes in the multi-gene 
patterns of expression can provide dues about regulatory mechanisms 
and broader cellular functions and biochemical pathways. In the 
context of hunian healdi and treatment, the knowledge gained from 
these types of measurements can help determine the causes and conse- 
quences of disease, how drugs and drug candidates work in cells 
and organisms, and what gene products might have therapeutic uses 
diemselves or maybe appropriate targets for therapeutic intervention. 

Past discussions of arrays have often centred on technical issues and 
specific performance characteristics^. Now that nucleic acid arrays have 
been constructed for many diflferent organisms'*"^^ and used success- 
fully to measure transcript abundance in a host of different experi- 
ments, the focus of interest has thankfully shifted. Investigators are now 
more concerned with questions concerning experimental design, data 
analysis, the use of small amounts of niRNA from limited sources, the 
best ways to extract biological meaning from the results, pathway and 
ceU-drcuitry modelling, and medical uses of e3q)ression patterns. 

Array-based gene expression monitoring 

One way to think of measurements with arrays is that they are simply 
a more powerful substitute for conventional methods of evaluating 



mRNA abundance. For some early experiments, only a relatively 
small set of genes, which were thought to be important to a process, 
were included on the arrays' However, such experiments did not 
capitalize on the arrays' potential: a key advantage of using arrays, 
especially those that contain probes for tens of thousands of different 
genes, is that it is not necessary to guess what the important genes or 
mechanisms are in advance. Instead of looking only under the 
proverbial lamppost, a broader, more complete and less biased view 
of the cellular response is obtained (Figs 2, 3). : 

The breadth of array-based observations almost guarantees that 
surprising findings will be made. A recent study measured the 
transcriptional changes that occur as cells progress through the 
normal cell-division cycle in humans for approximately 40,000 genes 
(R. J. Cho etalt unpublished results). In addition to the induction of 
DNA replication genes and genes involved with cell-cycle control and 
chromosome segregation that would be expected at specific stages in 
the cell cycle, a large collection of genes involved with smooth muscle 
function, apoptosis and intercellular adhesion and cell motility were 
found to be upregulated during a specific phase. The expected results 
act effectively as internal controls that provide a certain amount of 
validation (and comfort), while new information is obtained by a 
systematic search of a larger part of *gene space*. In addition, because 
arrays often contain probes for genes of unknown function (and 
often with only partial sequence information), any outcome for these 
could be considered, in some sense, both surprising and novel 
(although clearly requiring further characterization). 

Other gene expression methods 

Not surprisingly, there are other ways to measure mRNA abundance, 
gene expression and changes in gene expression. For measuring gene 
expression at the level of mRNA, northern blots, polymerase chain 
reaction after reverse transcription of RNA (RT-PCR), nuclease 
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Figure 3 Methods for analysing gene 
expression data shown for measurements of 
e)(pression in the ceil cycle of $. c&Bvisiae. 
a. Yeast celts were synchronized and cells 
were collected every ten minutes throughout 
two complete synchronous cycles (1 8 time 
points In total are shown). Expression data 
were collected by hybridizing labelled cONA 
samples to high-density oligonucleotide 
arrays. Transcript levels were detennined for 
almost every gene in the genome for every 
time point". A sample of 409 genes (from a 
total of 6,000) that showed both a significant 
(more than twofold) fluctuation in transcript 
levels during the time course and Cell cycle- 
dependent periodicity were selected for 
furtheranalysis, b. Dendrogram indicating 
similarity of expression profiles, calculated 
using the Pearson conelation function in the. 



Genetics, San Carios. G/Q. For display 
purposes, the relative expression levels were 
plotted In red (high) and blue (low), c. The , 
genes were divided into five different temporal 
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protection, cDNA sequencing, done hybridization, differential 
displa)^', subtractivehybridization, cDNAfragment fingerprinting'^*^ 
and serial analysisofgeneexpression(SAGE)^have all been puttogood 
use to measure the expression levels of specific genes, characterize 
global expression profiles or to screen for significant differences in 
mRNA abundance. But if messenger RNA is only an intermediate on 
the way to production of the functional protein products, why measure 
mRNA at all? One reason is simply that protein-based approaches are 
generally more difficvdt, less sensitive and have a lower throughput than 
RNA-based ones. But more importandy, mRNA levels are immensely 
informative about cell state and the activity of genes, and for most 
genes, changes in mRNA abundance are related to changes in protein 
abundance. Because of its importance, however, many methods have 
been developed for monitoring protein levels either directly or 
indirectly (see review in this issue by Pandey and Mann, pages 
837-846). These include western blots, two-dimensional gds, methods 
based on protein or peptide chromatographic separation and mass 
spectrometric detection'^, methods that use specific protein-fiision 
reporter constructs and colorimetric readouts**"^, and methods based 
on characterization of actively translated, polysomal mRNA^*^^ 

The importance of the protein-based methods is that they measure 
the final expression product rather than an intermediate. In addition, 
some of them enable the detection of post-translational protein modifi- 
cations (for example, phosphor)dation and glycosylation) and protein 
complexes, and in some cases, yield information about protein localiza- 
tion, none of which are obtained directly by measurements of mRNA. 
There is no question that protein- and RNA-based measurements are 
complementary, and that protein-based methods are important as they 
measureobservablesthatarenotreadilydetectedinotherways. 

Human disease, gene expression and discovery 

Genomics and gene expression experiments are sometimes derided 
as 'fishing expeditions*. Our view is that there is nothing wrong with a 
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fishing expedition** if what you are after is *fish*, such as new genes 
involved in a pathway, potential drug targets or expression markers 
that can be used in a predictive or diagnostic fashion. Because the 
arrays can be designed and made on the basis of only partial 
sequence information, it is possible to include genes in a survey that 
are completely uncharacterized. In many ways, the spirit of this 
approach is more akin to that of classical genetics in which muta- 
tions are made broadly and at random (not only in specific genes), 
and screens or selections are set up to discover mutants with an 
interesting phenotype, which then leads to fiirther characterization 
of specific genes. 

Such broad discovery experiments are probably better desaibed 
as 'question-driven' rather than hypothesis-driven in the conven- 
tional sense. But that is not to diminish their value for understanding 
basic biological processes and even for understanding and treating 
human disease. For example, by analysingmultiplesamples obtained 
firom individuals with and without acute leukaemia or diffuse large 
B-cell lymphoma, gene expression (mRNA) markers were discov- 
ered that could be used in the classification of these cancers*'*^. The 
importance of monitoring a large number of genes v^s well illustrat- 
ed in these studies. Golub etaL*^ found that reliable predictions could 
not be made based on any single gene, but that predictions based on 
the expression levels of 50 genes (selected horn the more than 6,000 
monitored on the arrays) were highly accurate. The results of both of 
these studies indicate that measurements with more individuals and 
more genes will be needed to identify robust expression markers that 
are predictive of clinical outcome. But even with the limited initial 
data it was possible to help clarify an unusual case (classic leukaemia 
presentation but atypical morphology) and to use this information 
to guide the patient's clinical care. 

It is also possible to take a related approach to help understand 
what goes wrong in cancerous, transformed cells and to identify 
the genes responsible for disease. Causative effects and potential 
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therapeutic targets can be identified by determining which genes are 
upregulated in different tumour types*'"", and specific candidate 
genes can be intentionally overexpressed in cell lines or cells treated 
with growth factors in order to identify downstream target genes 
and to explore signalling pathways*^". Tumorigenesis is often 
accompanied by changes in chromosomal DNA, such as genetic 
rearrangements, amplifications or losses of particular chromosomal 
loci, and developmental abnormalities, such as Down's or Turner's 
syndrome, may arise from aberrations in DNA copy number. 
Because genomic DNA can be interrogated in much the same way as 
mRNA, comparisons of the copy number of genomic regions or the 
genotype of genetic markers can be used to detect chromosomal 
regions and genes that are amplified or deleted in cancerous or 
pre-cancerous cells. By using arrays containing probes for a large 
number of genes or polymorphic markers, changes in DNA copy 
number have been detected in both breast cancer cell lines and in 
tumours*'^'. The identification of when and v^ere changes in copy 
number or chromosomal rearrangements have occurred can be used 
in both the classification of cancer types and the identification of 
regions that may harbour tumour-suppressor genes. 

Whole-genome hypotheses 

The use of genomics tools such as arrays does not, of course, preclude 
hypothesis-driven research. For fiilly sequenced organisms, arrays 
containing probes for every annotated gene in the genome have been 
produced'^-". With these one can ask, for example, whether a 
transcription factor has a global role in transcription (affecting all 
genes) or a specific role (affecting only some). Holstege et al^^ used 
this type of application in a genome-wide expression analysis in yeast 
to functionally dissect the machinery of transcription initiation. 
Similarly, genes located near the ends of chromosomes in yeast (as 
well as genes at the mating-type locus) are known to be transcription- 
ally *silent* Full genome arrays allow the chromosomal landscape of 
silencing to be mapped, and make it possible to test whethei- what is 
true for a handful of well-studied genes near the telomeres is true for 
all telomeric genes, and whether any centromere-proximal genes are 
also transcriptionally silenced". 

It is important to emphasize that these new, parallel approaches 
do not replace conventional methods. Standard methods such as 
northern blots, western blots or RT-PCR are simply used in a more 
targeted fashion to complement the broader measurements and to 
follow-up on the genes, pathways and mechanisms implicated by the 
array results. Because the incidence of false-positive results can be 
made sufficiently low (see Fig. 2), it is not necessary to independently 
confirm every change for the results to be valid and trustworthy, 
especially if conclusions are based on changes in sets of genes rather 
than individual genes. More detailed follow-up is recommended if a 
gene is being chosen, for example, as a drug target, as a candidate for 
population genetics studies, or as the target for the construction of a 
knockout mouse. 

Does gene expression indicate function? 

As additional, uncharacterized open reading fi*ames (ORFs) are 
identified in different organisms by the various genome sequencing 
projects, researchers have begun to ask whether the expression pat- 
tern for a gene can be used to predict the functional role of its protein 
product. An increasingly common approach involves using the gene 
expression behaviour observed over multiple experiments to first 
cluster genes together into groups (see Fig. 3), either by manual 
examination of the data", or by using statistical methods such as self- 
organizing maps**, K-tuple means clustering or hierarchical cluster- 
^gUMAi jj^g Yiosic assumption underlying this approach is that 
genes with similar expression behaviour (for example, increasing 
and decreasing together under similar circumstances) are likely to be 
related functionally. In this way, genes without previous functional 
assignments can be given tentative assignments or assigned a role in a 
biological process based on the known fiinctions of genes in the same 

NATURE|VOL405| l5IUNE2000jwww.nature.coin 




B Signal transduction 
i- Cellular blognesis 

■ Intracellular transport 
Q; Transport facilitation 

■ Protein destination 
a> Protein synthesis 

■ Transcription 

Q Cell growth, division, DNA synthesis 
d Energy 
m Metabolism 
a Cellular organization 




^ e 2000 Macmillan Magazines Ltd 



expression cluster (that is, the concept of *guilt-by-association'). The 
validity of this approach has been demonstrated for many genes in 
Saccharomyces cerevisiaCt a simple organism for which the entire 
genomic sequence and the functional roles of approximately 60% of 
* the genes are known"-**-*' (Fig. 4). Ahhough not logically rigorous, 
the utility of the guilt-by-association approach has been demonstrat- 
ed, as genes already known to be related do, in fact, tend to duster 
together based on their experimentally determined expression pat- 
terns (Fig. 4). The approach is made more systematic and statistically 
sound by calculating the probability that the observed ftmctional 
distribution of differentially expressed genes could have happened by 
chance. The application of statistical rigour is essential to avoid 
overly subjective interpreta tions of the results based on the predispo- 
sitions, prior knowledge and interests of the individual researcher. 

A tentative functional assignment may not be much more than a 
low-resolution description or general classification. Descriptions of 
this type are similar to those that come out of more classical genetic 
screens and selections, which have provided the vast majority of 
functional annotations to date — they indicate that genes are 
involved with a particular cellular phenotype and that they are likely 
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to be involved with a certain set of other genes and processes. This 
allows researchers to focus attention on a smaller subset of genes, 
many of which may not have been obvious candidates in the absence 
of the global expression observations. This overall approach high- 
lights the importance of functional annotation and carefiil curation 
of existing sequence^ function and knowledge databases (see below). 
Expression results covering thousands or even tens of thousands 
of genes and expressed sequence tags (ESTs) will be only partly 
interpretable given the functional and biological information 
available at the time they are initially generated. Our ability to extract 
knowledge from measurements of global gene expression tends to 
increase with time as additional information becomes available, and 
results can be subjected to further interrogation in the light of new 
information, observations, questions and hypotheses. 

Gene expression and the regulation of transcription 

When information on the complete genome sequence is available, as 
is the case for increasing numbers of small and even larger genomes, 
gene expression data can be used to identify new cis-regulatory 
elements (genomic sequence motifs that are over-represented in the 
genomic DNA in the vicinity of similarly behaving genes) and 
'regulons' (sets of co-regulated genes), the basic units of the underly- 
ing cellular circuitry (Fig. 3d). In fact, the correlation between the 
presence of specific sequence motifis in promoter regions and gene 
expr^sion patterns may be stronger than the correlation between 
functional categories and gene expression patterns. In yeast studies, 
more than 50% of the genes that are transcribed in a cell cydc- 
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specific manner and whose transcript abundance peaks in the Gl 
phase of the cell cycle have an MCB (Mlu cell-cycle box) within 500 
base pairs (bp) of their translation^ start site"*""*'. Similar observa- 
tions have been made for yeast genes whose transcription is induced 
. during sporulation*^ In addition, new as-regulatory elements may 
be revealed by examining classes of co-regulated genes (Fig. 3d). Wth 
sufficiently large numbers of experimental observations of expres- 
sion behaviour, the boundaries and all functioning sequence variants 
of ci>-regulatory elements might be predicted without the need for 
the more conventional approach using site-directed mutagenesis 
{'promoter bashing') . The expression-based method will be especial- 
ly valuable in exotic organisms, such as Plasmodium falciparum, the 
causative agent for malaria, for which experimental identification or 
verification of transcription factor binding sites is difficult. 

Gene expression profiles as 'fingerprints' 

An often overlooked aspect of measurements of global gene expres- 
sion is that the sequence or even the origin of the arrayed probes does 
not need to be known to make interesting observations — the 
complex profiles, consisting of thousands of individual observations, 
can serve as transcriptional *fingerprintsl The fingerprints can be 
used for classification purposes or as tests for relatedncss, in a similar 
manner to the way in which DNA fingerprints are used in paternity 
testing. In one example, transcriptional fingerprints have been used 
to determine the target of a drug™, the basic idea is diat jf a drug 
interacts with and inactivates a specific cellular protein, the pheno- 
type of the drug-treated cell should be very similar to the phenotype 
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of a cell in which the gene encoding the protein has been genetically 
inactivated, usually through mutation. Thus, by comparing the 
expression profile of a drug-treated cell to the profiles of cells in 
which single genes have been individually inactivated, specific 
mutants can be matched to specific drugs, and therefore, targets to 
drugs. In a demonsuation of this concept, the gene product of the 
his3 gene was identified correctly as the target of 3-aminotriazole^. 
Similarly, profiles have been used in the classification of cancers and 
the classification schemes did not depend on any specific informa- 
tion about the genes involved*'*^, although that information can be 
used to draw further biological and mechanistic conclusions. Finally, 
expression profiles can be used to classify drugs and their mode of 
action. For example, the functional similarity and specificity of 
different purine analogues have been determined by comparing the 
genome-wide effects on treated yeast, murine and human cells^''". 

Expression measurements from small amounts of RNA 

An important fi-ontier in the development of gene expression 
technology involves reduction of the required amount of starting 
material. Most array>based expression measurements are done using 
RNA from a million or more cells, and obtaining such a relatively 
large sample is riot a problem in many types of studies ( for example, 
litres of yeast cells can be grown easily). However, in some cases, it is 
important or even necessary to use fewer cells, as when using a small 
organ from a fly or worm, sorted cells that express a rare marker, or 
laser-capture microdissected^'* tumour tissue. Efficient and 
reproducible mRNA amplification methods are required, and there 
are two primary approaches that show significant promise. The first 
is a PGR- based approach that has been used to make single-cell cDNA 
libraries''^^*. We have found that the amplification is efficient and 
reproducible, but that the relative abundance of the cDNA products 
is not well correlated with the original mRNA levels (D. Giang and 
D. J. Lockhart, unpublished results), although normalization 
and referencing strategies can be used (D. de Graaf and E. Lander, 
personal conununication). 

The second approach avoids PGR altogether and uses multiple 
rounds of linear amplification based on cDNA synthesis and a 
template-directed in vitro transcription (IVT) reaction"^'. This 
method has been used to characterize mRNA fi-om single live 
neurons" and even subcellular regions, and more recently to amplify 
mRNA from 500 to 1,000 cells firom microdissected brain tissues for 
hybridization to spotted cDNA arrays". We have found that the 
multiple-round cDNA/lVT amplification method produces suffi- 
cient quantities of labelled material starting with as little as 1-50 ng 
total RNA, is highly reproducible (correlation coefficients greater 
than 0.97), and introduces much less quantitative bias than 
PCR-based amplification (D. Giang and D. J. Lockhart, unpublished 
results). These amplification methods facilitate the possibility of 
monitoring large number of genes starting with very limited 
amounts of RNA and very few cells. The combination of arrays 
and powerful amplification strategies promises to be especially 
important for studies that use human biopsy material from 
inhomogeneous tissue, and in the areas of developmental biology, 
immunology and neurobiology. 

Genome analysis using arrays 

Although nucleic acid arrays are often equated with gene expression 
analysis, they may be used to collect much of the data that are 
obtained presently by Southern or northern blot hybridization tech- 
niques, but in a more highly parallel fashion (Figs 5, 6). Their utility 
in polymorphism detection and genotyping is described elsewhere 
(see review in this issue by Roses, pages 857-665), but there are many 
additional uses for these versatile tools. For example, genomic DNA 
samples can be manipulated experimentally to select for particular 
regions before hybridization to obtain specific types of information. 
In yeast, the location of hundreds of chromosomal origins of replica- 
tion can be determined in parallel by enriching for early-replicating 
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regions using a variation of the Meselsohn-^tahl procedure and then 
hybridizing the resulting DNA to hill genome arrays (E A. Winzeler 
et al, unpublished results). Similarly, as probes for more intergenic 
regions are synthesized on arrays, it becomes possible to identify 
protein -binding sites: fragmented chromatin can be crosslinked to a 
protein and then immunoprecipitated with an antibody to that pro- 
tein. The DNA firaction of the immunoprecipitate can be labelled and 
hybridized to identify the approximate location of the binding site. In 
addition, full genome arrays can be used in the analysis of plasmid 
libraries in genetic selections such as two-hybrid screens" or, in 
principal, for any other type of experiment in which the information 
is contained in the form of RNA or DNA. Arrays also have 
applications in biophysical chemistry and biochemistry. For 
example, single-stranded DNA arrays were converted enzymatically 
into arrays of double-stranded DNA to characterize the interactions 
of proteins, and potentially other types of molecules, with double- 
stranded DNA**. 

Gene expression and cell circuitry 

Is it reasonable to consider the cell as a complex analogue circuit, and 
to attempt to reverse-engineer the ceU circuitry much like an electri- 
cal engineer would do by measuring currents and voltages at a variety 
of nodes and under a variety of input conditions? In the case of 
the cell, expression levels and expression changes might take the place 
of electrical measurements, and could be measured under many 
experimental conditions. Is it possible that a genetic or cellular circuit 
of reasonable complexity could be adequately decoded or modelled, 
and if so, how many and what types of measurements and perturba- 
tions (or ^inputs') would be required so that the problem was not 
hopelessly underdetermined"^'? Reasonably detailed circuit 
diagrams can be drawn and simulations of simple genetic circuits 
have been performed for systems of low complexity (for example, the 
lytic cycle of phage lambda, and simple control networks in 
Escherichia coli bacteria**). But the situation is considerably more 
complex in the case of a eukaryotic cell. Using yeast as an example, if 
we assume that the expression level for each gene can be one of only 
four levels (off, low. medium or high), then if the 6,200 yeast genes 
behave independently, there are 6,200\ or --1.5 x 10'* possible 
expression states. Of course, the expression levels of different genes 
are not all independent of one another, and there are some states that 
are physically unrealistic (for example, all genes off or all genes 
'high'), but the number of possible cellular configurations is very 
large. In addition, coupling between circuit components, the effects 
of nonlinear feedback, redundancy and even noise and stochastic 
events make simulating a circuit of this complexity a rather daunting 
task, and not all relationships and cellular events are reflected at the 
level of mRNA abundance. 

Least clear may be what types of perturbations or inputs are likely 
to be the most informative in terms of defining the relationships 
between genes and pathways, and what might be a minimal set of 
^orthogonal perturbations* (treatments, genetic manipulations or 
growth conditions that have minimal overlap in their direct cellular 
effects). Certainly it is possible to delete every yeast gene one at a time 
(or even several at a time) and measure the expression profile for each 
mutant strain under a set of different growth conditions'"'^*. It is 
also possible to grow yeast on a matrix of thousands of different 
conditions and measure the resulting expression profiles for a range 
of mutated strains. It is clear that extensive experiments of this type, 
combined with information ft-om other measurements such as 
yeast two-hybrid protein-protein interaction screens'^, and 
measurements of protein levels, modification states and cellular 
localization will lead to useful groupings of genes in terms of function 
and regulation (that is, a genetic, molecular and functional taxono- 
my), and to supply some reasonably detailed information about the 
relationships between certain genes and pathways. In addition, sets 
of perturbations directed towards specific functions and 
cellular processes will allow higher-resolution and even mechanistic 
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information for significant parts of the overall circuitr/^\ However, 
given the tremendous complexity of the system, it is unlikely that a 
complete and detailed cellular circuit diagram will result for even 
single-celled eukaryotes such as yeast any time in the near future. But 
that is not to say that construction of even first-order global models 
and semi-quantitative circuit diagrams is not extremely useful. Such 
models serve to organize current information, relationships and 
hypotheses, and can be tremendously helpful for testing new 
hypotheses, interpreting new observations, designing new experi- 
ments and predirtingthe likely effects of particular chemical, genetic 
or cellular perturbations. They also serve as a scaffold upon which to 
build higher- resolution, more quantitative and complete models. 



types of studies is that a sufficient number of experiments be 
performed across multiple individuals and multiple tissue or tumour 
samples to account for individual variation and possible tissue 
inhomogeneity. Furthermore, confidence in the results is increased 
as conclusions are based on sets of genes that show a consistent 
response and that are consistendy different between two or more sets 
ofresults''•*•"'''^^ 



Can we have too inuch data? 

Contrary to what is sometimes thought, the biggest problem for 
making sense of the extensive results from genomics experiments is 
not that there is too much data or that there are insufficiently sophis- 
ticated algorithms and software tools for querying and visualizing 
data on this scale. Larger problems of data management and analysis 
have been solved by airlines, financial institutions, global retailen, 
high-energy and plasma physicists, the military and global weather 
predictors, amongothers. It Is often beneficial to have a large number 
of measurements and sometimes more data make it possible to 
analyse results that might otherwise have been too *messy', and to 
detect patterns and relationships that would not have been obvious 
or have sufficient statistical significance with smaller data sets. In 
many types of studies, it is not possible to control completely all 
variables, and the individual differences between common sample 
types may be significant because of experimental difficulties (for 
example, tissue inhomogeneity or variations in sample procedures) 
or individual genetic variation (for example, different patients or dif- 
ferent tumours). But such factors do not preclude the discovery of 
some genes that dearly ^cluster* or differentiate between the sample 
sets. For example, meaningftil results can be extracted from the 
analysis of human tissue collected at different hospitals, by different 
surgeons and at different times. An essential requirement in these 
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Making sense of genomic results 

Although the difficulties of sample collection, data collection and 
experimental design should not be underestimated, one of the most 
challenging aspects of gene expression analysis is making sense of the 
vast quantities of data and extracting conclusions and hypotheses that 
are biologically meaningful. From experiments on global gene expres- 
sion, we may obtain data for thousands of genes, often forcing us to 
consider processes, functions and mechanisms about which we know 
very little. Thus, there is a need for more sophisticated systems of 
knowledge representation (or 'knowledge bases*) that organize the 
data, fects, observations, relationships and even hypotheses that form 
the basis of our current sdentific understanding. This information 
needs to be more than just stored; it needs to be available in a 
way that helps scientists undentand and interpret the often 
complex observations that are becoming increasingly easy to make. 
Unfortunately, the fact is that die scientific literature has been 
somewhat haphazardly built, without the benefit of a controlled or 
restricted vocabulary and a well defined semantic and grammar. To 
take full advantage of the abilities of the new technologies and the 
rapidly increasing amount of sequence information it is absolutely 
essential to incorporate die facts, ideas, connections, observations 
and so forth, which exist in the sdentific literature and in the 
minds of scientists, into a form that is systematic, organized, 
linked, visualized and searchable. This clearly requires a great 
deal of dedicated, systematic human effort, but progress has 
been made. Databases such as the Saccharotnyces Genome 
Database (SGD: genome-www.stanford.edu/Saccharomyces), the 
Munich Information Center for Protein Sequences (MIPS: 
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www.mips.biochem.mpg.de), WormBase (www.wormbase.org), the 
Kyoto Encyclopedia of Genes and Genomes (KEGG: 
www.genome.ad.jp/kegg), the Encyclopedia of £ coli Genes and 
Metabolism (EcoCyc http://ecocycpanbio.com/ecocyc) and FlyBase 
(flybase.bio.Indiana.edu/) incorporate sequence, genetics, gene 
expression, homology, regulation, function and phenotype informa- 
tion in an organized and useable form**"*". But a step beyond databases 
of this type are ones in which concepts as well as fects are more folly 
integrated and related, allowing connections to be made between 
initially disparate observations and information, and across 
organisms. It is conceivable diat the next step will evolve to the level of a 
biological ^expert system', not unlike the expert system ('Big Blue*) that 
IBM scientists and engineers built to play chess (successfully) against 
the world's best chess player. Despite the potential for advancement 
on this front, it seems unlikely that computational tools will ever 
replace the trained human brain when it comes to making biological 
sense of new results. However, the appropriate tools are needed to bring 
information and relationships to scientist's fingertips so that the 
most insightful questions can be asked and the most meaningful 
interpretations made. 

Conclusion 

For these array-based methods to become truly revolutionary, they 
must become an integral part of the daily activities of the typical 
molecular biology laboratory. Despite their impressive and rapidly 
growing r^sum6, these technologies are still in their infancy, with 
plenty of room for technical improvements, further development, 
and more widespread acceptance and accessibility. We expect that the 
pattern of development and use of arrays and other parallel genomic 
methodologies will be similar to that seen for computers and other 
high-tech electronic devices, which started out as exotic and expen- 
sive tools in the hands of the few developers and early adopters, and 
then moved quickly to become easier to use, more available, less 
expensive and more powerful, both individually and because of their 
ubiquity. In fact, nucleic acid array-based methods that previously 
seemed exotic, and too expensive, are becoming routine as indicated 
by the huge increase in the number of publications that incorporate 
data obtained in this way. Despite the relative youth of these 
approaches, the achievement of technical goals that would have 
seemed like science fiction only a few years ago is now clearly in view. 
For example, we expect that measuring the expression level of essen- 
tially every gene (including variant splice forms) on an array or two 
starting with RNA from a small number of cells, or even a single cell, 
will soon be possible owing to advances in single-cell handling and 
RNA amplification methods, the output of large-scale sequencing 
efforts and achievable advances in array technology. In the future, 
arrays of peptides, proteins, small molecules, mRNAs, clones, tissues, 
cells and even multicellular organisms such as the nematode worm 
Caenorhabditis elegans may also become common. The combined 
use of all of these highly parallel methods, along with sequence 
information, computational tools, integrated knowledge databases, 
and the traditional approaches of biology, biochemistry, chemistry, 
physics, mathematics and genetics, increases the hopes of 
understanding the function and regulation of all genes and proteins, 
deciphering the underlying workings of the cell, determining the 
mechanisms of disease, and discovering ways to intervene with or 
prevent aberrant cellular processes in order to improve human health 
and well-being. □ 
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